Search CORE

28 research outputs found

Patchwork: A Patch-wise Attention Network for Efficient Object Detection and Segmentation in Video Streams

Author: Chai Yuning
Publication venue
Publication date: 20/08/2019
Field of study

Recent advances in single-frame object detection and segmentation techniques have motivated a wide range of works to extend these methods to process video streams. In this paper, we explore the idea of hard attention aimed for latency-sensitive applications. Instead of reasoning about every frame separately, our method selects and only processes a small sub-window of the frame. Our technique then makes predictions for the full frame based on the sub-windows from previous frames and the update from the current sub-window. The latency reduction by this hard attention mechanism comes at the cost of degraded accuracy. We made two contributions to address this. First, we propose a specialized memory cell that recovers lost context when processing sub-windows. Secondly, we adopt a Q-learning-based policy training strategy that enables our approach to intelligently select the sub-windows such that the staleness in the memory hurts the performance the least. Our experiments suggest that our approach reduces the latency by approximately four times without significantly sacrificing the accuracy on the ImageNet VID video object detection dataset and the DAVIS video object segmentation dataset. We further demonstrate that we can reinvest the saved computation into other parts of the network, and thus resulting in an accuracy increase at a comparable computational cost as the original system and beating other recently proposed state-of-the-art methods in the low latency range.Comment: ICCV 2019 Camera Ready + Supplementar

arXiv.org e-Print Archive

FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation

Author: Adam Hartwig
Chai Yuning
Chen Liang-Chieh
Leibe Bastian
Schroff Florian
Voigtlaender Paul
Publication venue
Publication date: 08/04/2019
Field of study

Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame. In contrast to previous work, our embedding is only used as an internal guidance of a convolutional network. Our novel dynamic segmentation head allows us to train the network, including the embedding, end-to-end for the multiple object segmentation task with a cross entropy loss. We achieve a new state of the art in video object segmentation without fine-tuning with a J&F measure of 71.5% on the DAVIS 2017 validation set. We make our code and models available at https://github.com/tensorflow/models/tree/master/research/feelvos.Comment: CVPR 2019 camera-ready versio

arXiv.org e-Print Archive

MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction

Author: Anguelov Dragomir
Bansal Mayank
Chai Yuning
Sapp Benjamin
Publication venue
Publication date: 11/10/2019
Field of study

Predicting human behavior is a difficult and crucial task required for motion planning. It is challenging in large part due to the highly uncertain and multi-modal set of possible outcomes in real-world domains such as autonomous driving. Beyond single MAP trajectory prediction, obtaining an accurate probability distribution of the future is an area of active interest. We present MultiPath, which leverages a fixed set of future state-sequence anchors that correspond to modes of the trajectory distribution. At inference, our model predicts a discrete distribution over the anchors and, for each anchor, regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at each time step. Our model is efficient, requiring only one forward inference pass to obtain multi-modal future distributions, and the output is parametric, allowing compact communication and analytical probabilistic queries. We show on several datasets that our model achieves more accurate predictions, and compared to sampling baselines, does so with an order of magnitude fewer trajectories.Comment: Appears in CoRL 201

arXiv.org e-Print Archive

Pseudo-labeling for Scalable 3D Object Detection

Author: Caine Benjamin
Chai Yuning
Chen Zhifeng
Ngiam Jiquan
Roelofs Rebecca
Shlens Jonathon
Vasudevan Vijay
Publication venue
Publication date: 02/03/2021
Field of study

To safely deploy autonomous vehicles, onboard perception systems must work reliably at high accuracy across a diverse set of environments and geographies. One of the most common techniques to improve the efficacy of such systems in new domains involves collecting large labeled datasets, but such datasets can be extremely costly to obtain, especially if each new deployment geography requires additional data with expensive 3D bounding box annotations. We demonstrate that pseudo-labeling for 3D object detection is an effective way to exploit less expensive and more widely available unlabeled data, and can lead to performance gains across various architectures, data augmentation strategies, and sizes of the labeled dataset. Overall, we show that better teacher models lead to better student models, and that we can distill expensive teachers into efficient, simple students. Specifically, we demonstrate that pseudo-label-trained student models can outperform supervised models trained on 3-10 times the amount of labeled examples. Using PointPillars [24], a two-year-old architecture, as our student model, we are able to achieve state of the art accuracy simply by leveraging large quantities of pseudo-labeled data. Lastly, we show that these student models generalize better than supervised models to a new domain in which we only have unlabeled data, making pseudo-label training an effective form of unsupervised domain adaptation

arXiv.org e-Print Archive

SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving

Author: Anguelov Dragomir
Chai Yuning
Erhan Dumitru
Kretzschmar Henrik
Rafferty Sean
Sun Pei
Yang Zhenpei
Zhou Yin
Publication venue
Publication date: 25/06/2020
Field of study

Autonomous driving system development is critically dependent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to accurately simulate the vehicle sensors such as cameras, lidar or radar is essential. However, current sensor simulators leverage gaming engines such as Unreal or Unity, requiring manual creation of environments, objects and material properties. Such approaches have limited scalability and fail to produce realistic approximations of camera, lidar, and radar data without significant additional work. In this paper, we present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle. Our approach uses texture-mapped surfels to efficiently reconstruct the scene from an initial vehicle pass or set of passes, preserving rich information about object 3D geometry and appearance, as well as the scene conditions. We then leverage a SurfelGAN network to reconstruct realistic camera images for novel positions and orientations of the self-driving vehicle and moving objects in the scene. We demonstrate our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios. We also create a novel dataset that contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model

arXiv.org e-Print Archive

To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels

Author: Anguelov Dragomir
Caine Benjamin
Chai Yuning
Ngiam Jiquan
Sun Pei
Vasudevan Vijay
Wang Weiyue
Zhang Xiao
Publication venue
Publication date: 24/06/2021
Field of study

3D object detection is vital for many robotics applications. For tasks where a 2D perspective range image exists, we propose to learn a 3D representation directly from this range image view. To this end, we designed a 2D convolutional network architecture that carries the 3D spherical coordinates of each pixel throughout the network. Its layers can consume any arbitrary convolution kernel in place of the default inner product kernel and exploit the underlying local geometry around each pixel. We outline four such kernels: a dense kernel according to the bag-of-words paradigm, and three graph kernels inspired by recent graph neural network advances: the Transformer, the PointNet, and the Edge Convolution. We also explore cross-modality fusion with the camera image, facilitated by operating in the perspective range image view. Our method performs competitively on the Waymo Open Dataset and improves the state-of-the-art AP for pedestrian detection from 69.7% to 75.5%. It is also efficient in that our smallest model, which still outperforms the popular PointPillars in quality, requires 180 times fewer FLOPS and model parameter

arXiv.org e-Print Archive

SoDA: Multi-Object Tracking with Soft Data Association

Author: Anguelov Dragomir
Chai Yuning
Hung Wei-Chih
Kretzschmar Henrik
Lin Tsung-Yi
Yang Ming-Hsuan
Yu Ruichi
Publication venue
Publication date: 19/08/2020
Field of study

Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking

arXiv.org e-Print Archive

TNT: Target-driveN Trajectory Prediction

Author: Anguelov Dragomir
Chai Yuning
Gao Jiyang
Lan Tian
Li Congcong
Sapp Benjamin
Schmid Cordelia
Shen Yi
Shen Yue
Sun Chen
Varadarajan Balakrishnan
Zhao Hang
Publication venue
Publication date: 21/08/2020
Field of study

Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent's potential target states

T

steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset

arXiv.org e-Print Archive

StarNet: Targeted Computation for Object Detection in Point Clouds

Author: Alsharif Ouais
Caine Benjamin
Chai Yuning
Chen Zhifeng
Han Wei
Ngiam Jiquan
Nguyen Patrick
Shlens Jonathon
Sun Pei
Vasudevan Vijay
Yang Brandon
Yi Xi
Zhou Yin
Publication venue
Publication date: 02/12/2019
Field of study

Detecting objects from LiDAR point clouds is an important component of self-driving car technology as LiDAR provides high resolution spatial information. Previous work on point-cloud 3D object detection has re-purposed convolutional approaches from traditional camera imagery. In this work, we present an object detection system called StarNet designed specifically to take advantage of the sparse and 3D nature of point cloud data. StarNet is entirely point-based, uses no global information, has data dependent anchors, and uses sampling instead of learned region proposals. We demonstrate how this design leads to competitive or superior performance on the large Waymo Open Dataset and the KITTI detection dataset, as compared to convolutional baselines. In particular, we show how our detector can outperform a competitive baseline on Pedestrian detection on the Waymo Open Dataset by more than 7 absolute mAP while being more computationally efficient. We show how our redesign---namely using only local information and using sampling instead of learned proposals---leads to a significantly more flexible and adaptable system: we demonstrate how we can vary the computational cost of a single trained StarNet without retraining, and how we can target proposals towards areas of interest with priors and heuristics. Finally, we show how our design allows for incorporating temporal context by using detections from previous frames to target computation of the detector, which leads to further improvements in performance without additional computational cost

arXiv.org e-Print Archive

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

Author: Anguelov Dragomir
Caine Benjamin
Chai Yuning
Chen Zhifeng
Cheng Shuyang
Chouard Aurelien
Dotiwalla Xerxes
Ettinger Scott
Gao Amy
Guo James
Han Wei
Joshi Aditya
Kretzschmar Henrik
Krivokon Maxim
Ngiam Jiquan
Patnaik Vijaysai
Shlens Jonathon
Sun Pei
Timofeev Aleksei
Tsui Paul
Vasudevan Vijay
Zhang Yu
Zhao Hang
Zhao Sheng
Zhou Yin
Publication venue
Publication date: 12/05/2020
Field of study

The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the overall viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.Comment: CVPR 202

arXiv.org e-Print Archive